OC-IA-P7 Neural Network training

This notebooks aims at locally training a neural network for sentiment analysis, before deployment on Azure.

We'll compare :

Preprocess data

Extract data and get a shuffled balanced sample of 10 000 tweets

Split dataset

Normalize data

Re-label sentiment feature (target)

Since for us the positive case is the case of negative/unhappy sentiment, we turn the "sentiment" column into expected values:

Clean text

Text must be cleaned before embedding. We'll remove:

The we'll apply stemming or lemmatization to enhance the model performance. We'll compare performance of both methods through the model result. Here is an example of each preprocessing method:

test_string = "@mimi2000 We, finally!: went to the shopping) 12centers! 34"
print('Test string:')
print(test_string)
print('\nPreprocessed string with lemmatization:')
print(DataPreprocessor(normalization='lem')._normalize_text(test_string))
print('\nPreprocessed string with stemming:')
print(DataPreprocessor(normalization='stem')._normalize_text(test_string))
print('\nPreprocessed string with no stemming/lemmaization:')
print(DataPreprocessor(normalization='keep')._normalize_text(test_string))
Test string:
@mimi2000 We, finally!: went to the shopping) 12centers! 34

Preprocessed string with lemmatization:
Loading vectors for word2vec model, please wait...
Vectors loaded.
we finally go shopping center

Preprocessed string with stemming:
Loading vectors for word2vec model, please wait...
Vectors loaded.
we final went shop center

Preprocessed string with no stemming/lemmaization:
Loading vectors for word2vec model, please wait...
Vectors loaded.
We finally went shopping centers

Embedding

For our first try, we'll use pre-trained Word2vec English model from Gensim.

To embed whole sentences, we'll average the vectors of each word.

Our function is ready to preprocess each dataset:

Train models

Now that we have cleaned the data, we can create the model:

Baseline classifier: logistic regression

Simple neural network

click here to go TensorBoard

The recall oscillates a lot. Maybe tuning the batch size will help? Let's train the model with different batch sizes, then, for each serie of resulting val_recall, compute its standard deviation:

Batch size does not seem to be significant to reduce val_recall oscillations. Maybe another activation function may help?

We notice that the model converges muche faster with SELU activation function, but it stil oscillates a lot.

We used lemmatization and Word2vec embedding. Let's compare with other normalizing and embedding methods.

Find best preprocessing methods

Lemmatization with glove embedding seem to be the best combination, we'll use it for next steps. But there is not a huge difference between all combinations.

Tuning hyperparameters

Since the recall is rather wobbly, we won't monitor it for tuning hyperparameters, but val_loss instead.

Now that the tuner has found good parameters, we can use them in our model:

This is a little better than our baseline recall on a simple logistic regression (72%).

LSTM network

Preprocess data

We can try to improve our model by using LSTM. To do so, we can't use the vectorized dataset we had till now: instead, we need to input sentences as sequences.

So let's first reprocess our dataset only to get cleaned words:

As the model is much lower to train, we'll work on a smaller sample.

First training

The model converges fast, so let's train it again with epoch=1:

Tuning hyperparameters

Let's try to improve this result by tuning hyperparameters:

This model is not much better than the previous and its training on a much smaller dataset (15% of data) is much longer (about 6 times).

BERT model

We'll use the standard BERT model with 12 hidden layers, 768 neurons per layer and 12 attention heads.

The text needs to be preprocessed and encoded to be fed into neural BERT neural network, so we'll also use the appropriated preprocessor.

Preprocess data

Since BERT has its own proprocessing and can take in account punctuation, conjugated forme, etc, we clean the text in the simplest way, just removing twitter names and urls (no stemming / lemmatization).

Let's preprocess example sentences:

Now let's test what our BERT model can do with our preprocessed sentences:

for each sentence:

First training

Now we can build the classification model:

As another baseline, below is a pretrained model specialized in binary sentiment analysis (source):

Let's apply it to our test set: